Goto

Collaborating Authors

 value function estimate



Belief-DependentMacro-ActionDiscovery inPOMDPsusingtheValueofInformation

Neural Information Processing Systems

This property can be observed directly from Eq. 2 when integration is replaced by summation. Closed-loop First, we construct the standard, closed-loopα-vectors, which represent the value function under closed loop dynamics [1,5]. Each point in the scatter plot represents a paired experiment with identical target dynamics.








Wasserstein Adaptive Value Estimation for Actor-Critic Reinforcement Learning

Baheri, Ali, Sharooei, Zahra, Salgarkar, Chirayu

arXiv.org Machine Learning

We present Wasserstein Adaptive Value Estimation for Actor-Critic (WAVE), an approach to enhance stability in deep reinforcement learning through adaptive Wasserstein regularization. Our method addresses the inherent instability of actor-critic algorithms by incorporating an adaptively weighted Wasserstein regularization term into the critic's loss function. We prove that WAVE achieves $\mathcal{O}\left(\frac{1}{k}\right)$ convergence rate for the critic's mean squared error and provide theoretical guarantees for stability through Wasserstein-based regularization. Using the Sinkhorn approximation for computational efficiency, our approach automatically adjusts the regularization based on the agent's performance. Theoretical analysis and experimental results demonstrate that WAVE achieves superior performance compared to standard actor-critic methods.


Simulation-Based Optimistic Policy Iteration For Multi-Agent MDPs with Kullback-Leibler Control Cost

Nakhleh, Khaled, Eksin, Ceyhun, Ekin, Sabit

arXiv.org Artificial Intelligence

This paper proposes an agent-based optimistic policy iteration (OPI) scheme for learning stationary optimal stochastic policies in multi-agent Markov Decision Processes (MDPs), in which agents incur a Kullback-Leibler (KL) divergence cost for their control efforts and an additional cost for the joint state. The proposed scheme consists of a greedy policy improvement step followed by an m-step temporal difference (TD) policy evaluation step. We use the separable structure of the instantaneous cost to show that the policy improvement step follows a Boltzmann distribution that depends on the current value function estimate and the uncontrolled transition probabilities. This allows agents to compute the improved joint policy independently. We show that both the synchronous (entire state space evaluation) and asynchronous (a uniformly sampled set of substates) versions of the OPI scheme with finite policy evaluation rollout converge to the optimal value function and an optimal joint policy asymptotically.